Take-home Exercise 3

Visual Analytics of Resale Prices of Singapore Public Housing Properties

Author

Michael Djohan

Published

February 5, 2023

Modified

February 11, 2023

1. Overview

This exercise aims to uncover the salient patterns of the resale prices of public housing property by residential towns and estates in Singapore using appropriate analytical visualisation techniques. The visualization is designed using ggplot2, its extensions, and tidyverse packages.

The original dataset was downloaded from Data.gov.sg titled Resale flat princes based on registration date from Jan-2017 onwards.

The file downloaded was resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv

The focus of the study is on 3-ROOM, 4-ROOM and 5-ROOM types for 2022 period.

2. Data Preparation

2.1 Install R packages and import dataset

The code chunk below uses pacman::p_load() to check if packages are installed. If they are, they will be launched into R. The packages installed are

  • plotly: Used for creating interactive web-based graphs.

  • ggstatsplot: Used for creating graphics with details from statistical tests.

  • knitr: Used for dynamic report generation

  • pacthwork: Used to combine plots

  • ggdist: Used for visualising distribution and uncertainty

  • ggthemes: Provide additional themes for ggplot2

  • tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.

    All packages can be found within CRAN.

Import data from csv using readr::read_csv() and store it in variable flatprice.

#Load packages
pacman::p_load(plotly, ggstatsplot, knitr, patchwork, ggdist, ggthemes, tidyverse)

#Import data
flatprice <- read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv", show_col_types = FALSE)

2.2 Data wrangling

Looking at the data below, we notice few problems

  • month is in <chr> format (“yyyy-mm”), which is not very useful for filtering for 2022 period

  • lease_commence_date is in <dbl> format. It needs to be converted to <chr>

  • remaining_lease is in <chr> format. It needs to be reformatted to <dbl> in years

flatprice
# A tibble: 146,215 × 11
   month   town    flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
   <chr>   <chr>   <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
 1 2017-01 ANG MO… 2 ROOM  406   ANG MO… 10 TO …      44 Improv…    1979 61 yea…
 2 2017-01 ANG MO… 3 ROOM  108   ANG MO… 01 TO …      67 New Ge…    1978 60 yea…
 3 2017-01 ANG MO… 3 ROOM  602   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
 4 2017-01 ANG MO… 3 ROOM  465   ANG MO… 04 TO …      68 New Ge…    1980 62 yea…
 5 2017-01 ANG MO… 3 ROOM  601   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
 6 2017-01 ANG MO… 3 ROOM  150   ANG MO… 01 TO …      68 New Ge…    1981 63 yea…
 7 2017-01 ANG MO… 3 ROOM  447   ANG MO… 04 TO …      68 New Ge…    1979 61 yea…
 8 2017-01 ANG MO… 3 ROOM  218   ANG MO… 04 TO …      67 New Ge…    1976 58 yea…
 9 2017-01 ANG MO… 3 ROOM  447   ANG MO… 04 TO …      68 New Ge…    1979 61 yea…
10 2017-01 ANG MO… 3 ROOM  571   ANG MO… 01 TO …      67 New Ge…    1979 61 yea…
# … with 146,205 more rows, 1 more variable: resale_price <dbl>, and
#   abbreviated variable names ¹​flat_type, ²​street_name, ³​storey_range,
#   ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

is.na() function is also used to confirm that there are no missing values in the flatprice dataset.

#Check for missing values
any(is.na(flatprice))
[1] FALSE

The code chunk below performs the required data wrangling to clean flatprice dataset and store it in new variable flatpriceclean.

  • Filter flat_type for 3 ROOM, 4 ROOM, and 5 ROOM as this is the scope of the study using dplyr::filter()
  • Convert the month variable to date using as.Date(). Store the year and month to the respective new variables year and month using dplyr:mutate(). They can then be converted to string (discrete variables) using as.character(). Afterwards, we can filter the year variable to 2022 using dplyr::filter(), which is the scope of the study
  • Extract the year and month digits from remaining_lease variable using str_extract() function. Sum the year digit and (month digit/12) to obtain the years of remaining lease and convert it to <dbl> format using as.numeric(). They are then rounded to 1 decimal place using round(). Assign new variable called remaining_lease_years using dplyr::mutate()
  • Create new variable called resale_price_persqm to divide the resale_price with floor_area_sqm. This is performed to normalize the resale price to flat area. They are then rounded to 1 decimal place using round(). The new variable is assigned using dplyr::mutate
  • Convert lease_commence_date to <int> using as.integer()
#Data preparation
#store the new dataset in new variable flatpriceclean
flatpriceclean <- flatprice |> 
  
#filter for 3-ROOM, 4-ROOM, 5-ROOM
  filter(flat_type %in% c('3 ROOM','4 ROOM','5 ROOM')) |> 
  
#reformat month and split it to month and year. Use year to filter for 2022
  mutate(year = as.character(format(as.Date(paste(month, "-01", sep="")), "%Y")),
         month = as.character(format(as.Date(paste(month, "-01", sep="")), "%m")),
         .before = 1)|> 
  filter(year == 2022) |> 

#mutate remaining_lease to remaining_lease_years
  mutate(remaining_lease_years = round((as.numeric(str_extract(remaining_lease, "^[0-9]+")) + 
                           ifelse(is.na(as.numeric(str_extract(remaining_lease, " [0-9]+"))), 0, as.numeric(str_extract(remaining_lease, " [0-9]+")))/12), digits = 1),
         
#create new variable called resale_price_persqm
         resale_price_persqm = round(resale_price/floor_area_sqm, digits = 1),

#convert lease_commence_date to integer
         lease_commence_date = as.integer(lease_commence_date),
         .after = remaining_lease) 

The final dataset flatpriceclean is displayed below.

kable(head(flatpriceclean), "simple")
year month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease remaining_lease_years resale_price_persqm resale_price
2022 01 ANG MO KIO 3 ROOM 320 ANG MO KIO AVE 1 07 TO 09 73 New Generation 1977 54 years 05 months 54.4 4904.1 358000
2022 01 ANG MO KIO 3 ROOM 225 ANG MO KIO AVE 1 07 TO 09 67 New Generation 1978 55 years 01 month 55.1 5298.5 355000
2022 01 ANG MO KIO 3 ROOM 331 ANG MO KIO AVE 1 07 TO 09 68 New Generation 1981 58 years 58.0 4970.6 338000
2022 01 ANG MO KIO 3 ROOM 534 ANG MO KIO AVE 10 07 TO 09 82 New Generation 1980 57 years 02 months 57.2 5122.0 420000
2022 01 ANG MO KIO 3 ROOM 578 ANG MO KIO AVE 10 04 TO 06 67 New Generation 1980 57 years 01 month 57.1 4895.5 328000
2022 01 ANG MO KIO 3 ROOM 452 ANG MO KIO AVE 10 01 TO 03 83 New Generation 1979 56 years 07 months 56.6 4337.3 360000

3. Visualisation

3.1 Exploratory Data Visualisation

The plots here are preliminary in nature but designed with interactivity to allow users to perform Exploratory Data Analysis (EDA) Visualisation to study the data. Hence plotly is mainly used.

3.1.1. Interactive scatterplot

Design Consideration

Scatterplots are generally used to discover relationship between two continuous variables. As such, the visualization below allows users to select the x-axis and y-axis of the continuous variables they wish to study. Considerations :

  • Dropdown menu to select variables for y-axis and x-axis

  • y-axis selection is resale_price and resale_price_persqm. This is aligned with the study purpose of discovering patterns of resale price. This allows users to also see the intent of normalising resale_price by floor_area_sqm

  • x-axis selection is other continuous variables, namely: remaining_lease_years, lease_commence_date, and floor_area_sqm

  • As the plots are expected to be very cluttered, opacity is introduced with white border

  • Tooltip indicating variables of interest : remaining_lease_years, floor_area_sqm, and resale_price_persqm

Preparation of visualisation

plot_ly is used to prepare the interactive plot. Steps taken are

  • Initiating base scatterplot, indicated by type = 'scatter'. marker argument is used to introduce opacity and line (white plot border)

  • Tooltip customisation is created using hovertemplate argument

  • layout argument is used to add plot title, x-axis title, and y-axis title

  • To create the dropdown menu for parameters of x-axis and y-axis, the updatemenus argument is used to create respective buttons

Refer to code below for more details

Show the code
#Initiating the base plot
plot_ly(data = flatpriceclean,
        x = ~remaining_lease_years,
        y = ~resale_price_persqm,
        hovertemplate = ~paste("<br>Resale Price per sqm:", resale_price_persqm,
                               "<br>Floor Area (sqm):", floor_area_sqm,
                               "<br>Remaining Lease (Year):", remaining_lease_years),
        type = 'scatter',
        mode = 'markers',
        marker = list(opacity = 0.6,
                      sizemode = 'diameter',
                      line = list(width = 0.2, color = 'white'))) |> 

#Generating plot, x-axis, and y-axis title
  layout(title = "Interactive scatterplot of resale price vs selected factors\nResale transactions, 2022",
         xaxis = list(title = "Remaining Lease (Year)"),
         yaxis = list(title = "Resale Price per sqm (SGD)"),
         
#creating dropwdown menus to allow selection of parameters on x-axis and y-axis 
         updatemenus = list(list(type = "dropdown",
                                 direction = "up",
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 yanchor = "top",
                                 x = 1,
                                 y = 0,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$remaining_lease_years)),
                                                    list(xaxis = list(title = "Remaining Lease (Year)"))),
                                        label = "Remaining Lease"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$lease_commence_date)),
                                                    list(xaxis = list(title = "Year of Lease Commenced"))),
                                        label = "Lease Commenced"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$floor_area_sqm)),
                                                    list(xaxis = list(title = "Floor Area (sqm)"))),
                                        label = "Floor Area")
                                   )
                                 ),
                            
                            list(type = "dropdown",
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 yanchor = "top",
                                 x = 0.04,
                                 y = 0.95,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(y = list(flatpriceclean$resale_price_persqm)),
                                                    list(yaxis = list(title = "Resale Price per sqm (SGD)"))),
                                        label = "Resale Price/Area"),
                                   list(method = "update",
                                        args = list(list(y = list(flatpriceclean$resale_price)),
                                                    list(yaxis = list(title = "Resale Price (SGD)"))),
                                        label = "Resale Price")
                                   )
                                 )
                            )
         )
Insights
  • The most obvious pattern is the direct positive correlation between resale price and floor area. Hence the decision to normalize the resale price is justified. From here onwards, we will focus more on resale price per sqm.

  • Plotting resale price per sqm vs remaining lease reveals somewhat positive exponential correlation. There seems to sharper increase in gradient if the remaining lease is beyond 80 years

  • As expected, the same pattern is revealed when resale price per sqm is plotted against the year of lease commenced. In fact the year of lease commenced and remaining lease are directly correlated to each other as shown in the plot below. Hence it makes sense to use one of them in future analysis. In this case, we will only use remaining lease as it is more intuitive.

Show the code
ggplot(data = flatpriceclean)+

  geom_point(aes(x = lease_commence_date,
                 y = remaining_lease_years)) +
  
  labs(
    x = "Year of Lease Commenced",
    y = "Remaining Lease\n(Year)") +

  theme(axis.title.y = element_text(angle = 0))

3.1.2. Interactive boxviolin plot

Design Consideration

In contrast to scatterplots, boxplots are generally used to discover relationship between continuous variables and discrete variables. Violin plots can also be added to visualize the kernel density of the data.

As such, the visualization below allows users to select the x-axis of the discrete variables they wish to study. Considerations :

  • Dropdown menu to select variables for x-axis
  • y-axis selected is resale_price_persqm. This is aligned with the study purpose of discovering patterns of resale price
  • x-axis selection is the discrete variables, namely: flat_type, flat_model, storey_range, town, and month. Note that month (transaction month) is treated as discrete variable as we want to find out any seasonal variations (statistically significant difference in mean/median) of price.
  • x-axis title is omitted as it will be displayed by the selected dropdown menu
  • Combining boxplots and violin plots to visualize the quartiles and kernel density of the distribution
  • Tooltip to show mean, quartiles, and outliers
  • Plot can be zoomed-in or out as it might be difficult to get details with large number of categories (i.e., town)
  • Order the variables by ascending order (mean) if they are not lexicographically important (i.e., flat_type, storey_range, and month
Preparation of visualisation

plot_ly is used to prepare the interactive plot. Steps taken are

  • Initiating base scatterplot, indicated by type = 'violin'. Note box visible argument is set to T to ensure the boxplot is visible. The same is applied to meanline visible to ensure meanline is visible.

  • layout argument is used to add plot title, and y-axis title

  • To create the dropdown menu for parameters of x-axis, the updatemenus argument is used to create respective buttons

  • categoryorder is set for each button to determine the order of discrete variables. category ascending will ensure lexicographic order while mean ascending will ensure mean order.

Refer to code below for more details

Show the code
#Initiating the base plot
plot_ly(data = flatpriceclean,
        x = ~flat_type,
        y = ~resale_price_persqm,
        type = "violin",
        marker = list(line = list(width = 2)),
        box = list(visible = T),
        meanline = list(visible = T)) |> 
  
#Generating plot and y-axis title  
  layout(title = "Distribution of resale price by selected factors \nResale transactions, 2022",
         xaxis = list(title = ""),
         yaxis = list(title = "Resale Price per sqm (SGD)"),
         
#creating dropwdown menus to allow selection of parameters on x-axis 
         updatemenus = list(list(type = 'dropdown',
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 x = 0.04, 
                                 y = 0.95,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$flat_type)),
                                                    list(xaxis = list(categoryorder = "category ascending"))),
                                        label = "Flat Type"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$flat_model)),
                                                    list(xaxis = list(categoryorder = "mean ascending"))),
                                        label = "Flat Model"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$storey_range)),
                                                    list(xaxis = list(categoryorder = "category ascending"))),
                                        label = "Storey Height"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$town)),
                                                    list(xaxis = list(categoryorder = "mean ascending"))),
                                        label = "Town"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$month)),
                                                    list(xaxis = list(tickmode = "array")),
                                                    list(color = list(flatpriceclean$month))),
                                        label = "Transaction Month")
                              
                                   )
                                 )
                            )
         )
Insights

Note that insights can be garnered by zooming in some plots with high number of categories

  • When plotted by flat_type, preliminary EDA reveals minimal difference between the means/medians. This is expected since we are using resale_price_persqm. Note that flats with more rooms tend to be bigger in area, hence when normalized by area, the difference in resale price among different types of flat is minimized. It is interesting to note that 5 ROOM flat type has the lowest mean overall.

  • When plotted by flat_model, preliminary EDA reveals higher mean/median resale_price_persqm for models like DBSS, Terrace, Premium, S2 and S1. However, it is important to note that these types tend to be few in the market as well, which likely forms the more exclusive models. Note large variance are observed for improved and Model A types.

  • When plotted by storey_range, preliminary EDA reveals higher mean/median resale_price_persqm for higher storey. Note large variance are observed for 19 to 27th storey units.

  • When plotted by town, preliminary EDA reveals differences in mean/median resale_price_persqm for different town estates. At the lower end, we can find newly developed or further towns (i.e., Woodlands, Pasir Ris, Jurong West), while at the opposite end, we find more mature or centrally located towns (i.e., Central Area, Queenstown, Kallang/Whampoa). Note that the data looks skewed right with most of the mean is higher than the median and high presence of outlier on the higher end.

  • Preliminary EDA reveals minimal difference in resale_price_persqm between transaction months.

However, these observations need to be confirmed using appropriate hypothesis techniques. Since the focus of the study is on resale price by residential towns/estates, only this will be considered.

3.2 Confirmatory Data Analysis Visualisation

The plots here are focused in visual statistical testing used in Confirmatory Data Analysis. Hence ggstatsplot is mainly used.

Design Consideration

To ensure that best practices (APA standard) of statistical reporting is followed, ggstatsplot library is used in this section. Considerations :

  • As revealed by EDA, town variables have 26 categories. Hence ggbetweenstats will be used for one-way ANOVA testing.

  • Median will be used instead of mean as point estimates due to high presence of outliers and skewness of data.

  • Since we are unable to confirm normality assumption, non-parametric test (Kruskal-Wallis) will be used for hypothesis testing. Note that the null hypothesis is no difference between median resale price per area by towns/estates.

  • Given the large number of categories, two plots are considered, splitting the towns seeing >= 850 transactions and those seeing <850 transactions in 2022. This is mainly to ensure that the selected towns can fit in single plot (13 towns each) as having 26 towns in one single plot causes it to be too cluttered. This also helps to reduce bias against variations in number of observations. Note that the y-axis range needs to be fixed for the two plots for easier comparison.

Preparation of visualisation

ggstatsplot is used to prepare the plot. Steps taken are

  • Order the towns by the median of resale_price_persqm in ascending order using forcats::fct_reorder. This will ensure the towns are ordered accordingly for easier observation.

  • ggstatsplot::ggbetweenstats() is used to display the one-way ANOVA plot. Note that the dataset is grouped by town and filtered when count >=850. This is achieved using dplyr::group_by() and dplyr::filter()

  • As above design consideration, non-parametric test is considered hence type = "np"

  • Comparisons are made pairwise between non-significant pairs as they are the points of interest. Note that the majority of the pairs comparison are significant as well, hence it will clutter the plots unnecessarily. The argument pairwise.display = "ns"

  • Two plots are created and assigned to p1 and p2 respectively

  • Two sets of color are manually provided using scale_color_manual of ggplot2. This is because the basic palette is not able to support large number of colors (26)

  • scale_y_continuous is used to set the y-axis range between 4,000 and 20,000

  • patchwork is added to combine the two plots using operator /

  • plot_annotation() is used to configure the title, subtitle and caption of the combined visualisation

Refer to code below for more details

Show the code
#Order the town variables by median of resale price
flatpriceclean$town <- fct_reorder(flatpriceclean$town, flatpriceclean$resale_price_persqm, median)

#set the color pallete
col13_1 <- c("dodgerblue2", "#E31A1C", 
  "green4",
  "#6A3D9A", 
  "#FF7F00", 
  "gray30", "gold1",
  "skyblue2", "#FB9A99", 
  "palegreen2",
  "#CAB2D6", 
  "#FDBF6F", 
  "gray80") 
  
col13_2 <- c("khaki2",
  "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown", "gray50")

#Initiating the base plot
p1 <- ggbetweenstats(
  data = flatpriceclean|> 
    group_by(town) |> 
    filter(n() >= 850),
  x = town, 
  y = resale_price_persqm, 
  ylab = "Resale Price per sqm (SGD)",
  title = "Plot 1. One-way ANOVA (towns >= 850 transactions)",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns",
  mean.ci = TRUE,
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) +
  scale_color_manual(values = col13_1) +
  theme(axis.title.x = element_blank()) +
  scale_y_continuous(limits = c(4000, 20000))

p2 <- ggbetweenstats(
  data = flatpriceclean|> 
    group_by(town) |> 
    filter(n() < 850),
  x = town, 
  y = resale_price_persqm, 
  xlab = "Planning Town",
  ylab = "Resale Price per sqm (SGD)",
  title = "Plot 2. One-way ANOVA (towns < 850 transactions)",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns",
  mean.ci = TRUE,
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) +
  scale_color_manual(values = col13_2) +
  scale_y_continuous(limits = c(4000, 20000))

final_plot <- p1/p2
final_plot + plot_annotation(
  title = 'One-way ANOVA reveals differences in 2022 resale price across different towns',
  subtitle = 'Resale price per flat area by Town, 2022',
  caption = 'Resale flat princes based on registration date from Jan-2017 onwards from data.gov.sg',
  theme = theme(plot.title = element_text(face = "bold", size = 18))
)

Insights

As noted above, the hypothesis testing is done using Kruskal-Wallis test with 95% confidence level for the individual plot. For each plot, they hypothesis is:

  • H0 : There is no difference between median resale price per flat area across towns

  • H1 : There is difference between median resale price per flat area across towns

Since the p-value < critical value of 0.05, there is statistical evidence to reject the null hypothesis. We can conclude that for each set of towns (grouped by number of transactions) that there is difference between median resale price per flat area across the towns.

In both plots, it can be observed that there are few pairs of towns with p-value > 0.05 such as Jurong West and Choa Chu Kang. This suggests that the differences between the medians of the pair are not statistically significant.

We can also observe that there is overall higher median and higher number of towns with signficiant outliers (i.e., Central Area, Queenstown, Toa Payoh, Bishan, Clementi) in the second plot (towns with < 850 transactions).

Overall, the findings of the EDA is more or less consistent (refer to section 3.1.2 above).

3.3 Visualising the uncertainty of point estimates

It is tempting to interpret point estimate like median as precise representation of the true data value. However, it is worth noting that there are uncertainties surrounding the point estimates, hence the motivation to display the the target quantile confidence levels (95% or 99%) that the true (unknown) estimate would lie within the interval, given the evidence provided by the observed data.

Design Consideration

ggdist::stat_pointinterval() is used to visualise the uncertainty. Considerations :

  • The aim is to provide one-glance of levels of uncertainties of median estimate of resale price by town. Hence there is no need to clutter the plot with too much details (i.e., labels and annotations) and it is possible to present this in a single plot

  • Median point estimates are used instead of mean to adhere with non-parametric testing used above. With this, user can still relate to the above one-way ANOVA analysis

  • With median used as point estimate, quantile interval is used instead of confidence interval

  • 95% and 99% intervals are used as they are commonly associated with 5% and 1% error rate, which are commonly used in hypothesis testing

Preparation of visualisation

ggplot2 and stat_pointinterval() are used to prepare the plot. Steps taken are

  • Use the base ggplot() to determine the data and x, y-variables

  • Define confidence levels in width argument of stat_pointinterval(). Define the .point estimate as median and .interval as qi (quantile intervals). The argument point_color also assigns color to the point.

  • scale_color_manual() of ggplot2 is used to assign colors for the interval lines.

  • Setting the title, subtitle and caption using lab() and theme() of ggplot2

Show the code
#Base ggplot
ggplot(
  data = flatpriceclean,
  aes(x = town,
      y = resale_price_persqm)) +
  
#Using stat_pointinterval to plot the points and intervals
  stat_pointinterval(
    aes(interval_color = stat(level)),
    .width = c(0.95, 0.99),
    .point = median,
    .interval = qi,
    point_color = "darkred",
    show.legend = FALSE) +

#Defining the color of the intervals 
  scale_color_manual(
    values = c("blue2", "darkblue"),
    aesthetics = "interval_color") +
  
#Title, subtitle, and caption
  labs(title = 'Visualising uncertainty in median estimates of resale price by town',
  subtitle = 'Confidence intervals (95% and 99%) of resale price per flat area by Town, 2022',
  caption = 'Resale flat princes based on registration date from Jan-2017 onwards from data.gov.sg',
  x = "Planning Town",
  y = "Resale Price per sqm (SGD)") +
  
  theme_hc() +
  
  theme(plot.title = element_text(face = "bold", size = 18),
        plot.subtitle = element_text(size = 14),
        axis.text.x = element_text(angle = 30, hjust = 1))

Insights

As noted above, some towns like Central Area, Bukit Merah, and Bishan have higher uncertainties. This could be potentially due to large presence of outliers in those towns, more varied flat types or models or simply sampling variability problem. On the opposite spectrum, towns like Choa Chu Kang, Sembawang and Bukit Timah have lower uncertainties, which might indicate lower presence of outliers or more uniform flat types.

3.4 Visualisation of Resale Price vs Remaining Lease by Township

Section 3.1 above allow users to perform EDA to explore the relationship between resale price with other variables. Section 3.2 confirms statistical evidence of difference of resale price by towns. It is hence interesting to combine these insights together in one interactive visualization.

Design Consideration

One of the most interesting finding in EDA of two continuous variables is the relationship between resale price per sqm and remaining lease_years. Unfortunately, without applying any filters, there is very little insight can be garnered from this two-dimensional plot. Additional dimensions could be introduced by applying town filter, which was found to be statistically significant in Section 3.2. Considerations:

  • Interactive bubble plots of resale_price per sqm and remaining_lease_years

  • Dropdown menu to filter by town. Note that introduction of this filter not only helps users to visualize the different aspects of the bubble plot per town, but also reduces cluttering by too many datapoints.

  • The bubbles can be colored by flat_type and its size corresponds to the floor_area_sqm. The flat_type is chosen for color as it has the least number of categories, avoiding cluttering the plot with too many colors. The size corresponds to the floor_area_sqm to align with user’s intuition that bigger bubble means bigger size. While the resale price is already normalized against this, users can also see different angle given with remaining_lease_years in the x-axis. Note that with plot_ly, the color legend can also be selected/deselected to apply additional filter

  • To reduce cluttering further, opacity is introduced with white border

  • Tooltip indicating other important variables : resale_price_persqm, floor_area_sqm, remaining_lease_years, flat_model, and storey_range

Preparation of visualisation

Firstly, each unique instance of town was defined using for-loop. This will create a list within list (town_list) containing the method, args, and label arguments of each town. This list will be used as an input to buttons argument of the updatemenus. The annotation variables (annot) for the dropdown menu was also defined.

Refer to code below for more details

town_list <- list()
for (i in 1:length(unique(flatpriceclean$town))) { 
  town_list[[i]] <- list(method = "restyle",
                         args = list("transforms[0].value",
                                     unique(flatpriceclean$town)[i]),
                         label = unique(flatpriceclean$town)[i])
  }

annot <- list(list(text = "Planning Town:",
                   x = 1.41,
                   y = 0.78,
                   xref = 'paper',
                   yref = 'paper',
                   showarrow = FALSE))

plot_ly() is used to prepare the interactive bubble plot. Steps taken are

  • flatpriceorder variable is created to ensure that there is no clash between the color and transforms argument in plot_ly.

  • Initiating base scatterplot, indicated by type = 'scatter'. marker argument is used to introduce opacity and line (white plot border). Note that color and size arguments are specified to be flat_type and floor_area_sqm respectively.

  • Tooltip customisation is created using hovertemplate argument

  • transform argument is set to filter based on selected town provided in the updatemenus

  • layout argument is used to add plot title, x-axis title, and y-axis title. The range of the axis are also fixed to give users fixed reference frame.

  • To create the dropdown menu to select town, the updatemenus argument is used, passing town_list in the buttons argument. Note that annot is passed to annotations argument as well.

Refer to code below for more details

Show the code
flatpriceorder <- flatpriceclean[order(flatpriceclean$flat_type), ]

#Initiating the base plot
plot_ly(data = flatpriceclean,
        x = ~remaining_lease_years,
        y = ~resale_price_persqm,
        type = 'scatter',
        mode = 'markers',
        size = ~floor_area_sqm,
        sizes = c(5, 15),
        color = ~factor(flat_type),
        marker = list(opacity = 0.6,
                      sizemode = 'diameter',
                      line = list(width = 0.2, color = '#FFFFFF')),
        
#Creating tooltip details
        hovertemplate = ~paste("<br>Resale Price per sqm:", resale_price_persqm,
                               "<br>Floor Area (sqm):", floor_area_sqm,
                               "<br>Remaining Lease:", remaining_lease_years,
                               "<br>Flat Model:", flat_model,
                               "<br>Storey Height:", storey_range),

#Transform plotly by filtering the selected town in updatemenus       
        transforms = list(list(type = 'filter',
                               target = ~flatpriceorder$town,
                               operation = '=',
                               value = unique(flatpriceorder$town)[1])
                          )
        ) |> 

  
#Configuring the plot and axis title and range  
  layout(title = "Resale Price per flat area increases with remaining lease \nResale transactions by towns, 2022",
         xaxis = list(title = "Remaining Lease (Year)",
                      range = c(40, 100)),
         yaxis = list(title = "Resale Price per sqm (SGD)",
                      range = c(3000, 16000)),
         
#Creating the dropdown box for users to filter using town
         updatemenus = list(list(type = 'dropdown',
                                 xref = "paper",
                                 yref = "paper",
                                 x = 1.4, y = 0.7,
                                 buttons = town_list)
                            ),
         annotations = annot
         )
Insights

This new plot is quite insightful as it allows users to understand the relationship between resale price and remaining lease and flat types by towns/estates. Interesting insights:

  • The fixed reference frame (x-axis and y-axis range) allows users to see the lower resale price in Woodlands vs higher resale price in Kallang

  • One can also discern how newer estates like Punggol fetch relatively low resale price (<8kSGD) per sqm despite having > 80 years remaining lease. This is in comparison with Central Area which can fetch up to 14kSGD per sqm for the same remaining lease. This highlights the importance of town centrality in resale price

  • In a more mature estates like Bedok, Toa Payoh, or Queenstown, the older flats tend to be predominantly dominated by 3-room flats, fetching lower resale price. Between 60-80 years remaining lease, they are dominated by 4 & 5-room flats. Interestingly, beyond 80 years remaining lease, there are significant numbers of 3-room flats sold fetching competitive price.